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Abstract 

Herding and kernel herding are determinis- 
tic methods of choosing samples which sum- 
marise a probability distribution. A related 
task is choosing samples for estimating inte- 
grals using Bayesian quadrature. We show 
that the criterion minimised when selecting 
samples in kernel herding is equivalent to 
the posterior variance in Bayesian quadra- 
ture. We then show that sequential Bayesian 
quadrature can be viewed as a weighted ver- 
sion of kernel herding which achieves perfor- 
mance superior to any other weighted herd- 
ing method. We demonstrate empirically a 
rate of convergence faster than 0(1/ N). Our 
results also imply an upper bound on the em- 
pirical error of the Bayesian quadrature esti- 
mate. 



1 INTRODUCTION 

The problem: Integrals A common problem in 
statistical machine learning is to compute expectations 
of functions over probability distributions of the form: 

Z f . p = J f(x)p(x)dx (1) 

Examples include computing marginal distributions, 
making predictions marginalizing over parameters, or 
computing the Bayes risk in a decision problem. In this 
paper we assume that the distribution p(x) is known in 
analytic form, and f(x) can be evaluated at arbitrary 
locations. 

Monte Carlo methods produce random samples from 
the distribution p and then approximate the integral 
by taking the empirical mean Z = -k Y] 1 f Xn of 
the function evaluated at those points. This non- 
deterministic estimate converges at a rate 0(-i=). 



□ Herding Samples 
X SBQ Samples 




Figure 1: The first 8 samples from sequential Bayesian 
quadrature, versus the first 20 samples from herding. 
Only 8 weighted SBQ samples are needed to give an es- 
timator with the same maximum mean discrepancy as 
using 20 herding samples with uniform weights. Rela- 
tive sizes of samples indicate their relative weights. 



When exact sampling from p is impossible or impracti- 
cal, Markov chain Monte Carlo (MCMC) methods are 
often used. MCMC methods can be applied to almost 
any problem but convergence of the estimate depends 
on several factors and is hard to estimate (Cowles & 
Carlin, 1996). The focus of this paper is on quasi- 
Monte Carlo methods that - instead of sampling ran- 
domly - produce a set of pseudo-samples in a deter- 
ministic fashion. These methods operate by directly 
minimising some sort of discrepancy between the em- 
pirical distribution of pseudo-samples and the target 
distribution. Whenever these methods arc applicable, 
they achieve convergence rates superior to the C(^=) 
rate typical of random sampling. 



In this paper we highlight and explore the connec- 
tions between two deterministic sampling and integra- 
tion methods: Bayesian quadrature (bq) (O'Hagan, 
1991; Rasmussen & Ghahramani, 2003) (also known 
as Bayesian Monte Carlo) and kernel herding (Chen 
et al., 2010). Bayesian quadrature estimates integral 
(1) by inferring a posterior distribution over / condi- 
tioned on the observed evaluations f Xn , and then com- 
puting the posterior expectation of Zf p . The points 
where the function should be evaluated can be found 
via Bayesian experimental design, providing a deter- 
ministic procedure for selecting sample locations. 

Herding, proposed recently by Chen ct al. (2010), pro- 
duces pseudosamples by minimising the discrepancy of 
moments between the sample set and the target dis- 
tribution. Similarly to traditional Monte Carlo, an es- 
timate is formed by taking the empirical mean over 
samples Z = -k J^ n=1 f Xn ■ Under certain assump- 
tions, herding has provably fast, O(^) convergence 
rates in the parametric case, and has demonstrated 
strong empirical performance in a variety of tasks. 



Summary of contributions In this paper, we 
make two main contributions. First, we show that the 
Maximum Mean Discrepancy (MMD) criterion used to 
choose samples in kernel herding is identical to the ex- 
pected error in the estimate of the integral Zf_ p under 
a Gaussian process prior for /. This expected error is 
the criterion being minimized when choosing samples 
for Bayesian quadrature. Because Bayesian quadra- 
ture assigns different weights to each of the observed 
function values f(x), we can view Bayesian quadra- 
ture as a weighted version of kernel herding. We show 
that these weights are optimal in a minimax sense 
over all functions in the Hilbert space defined by our 
kernel. This implies that Bayesian quadrature dom- 
inates uniformly-weighted kernel herding and other 
non-optimally weighted herding in rate of convergence. 

Second, we show that minimising the MMD, when us- 
ing BQ weights is closely related to the sparse dictio- 
nary selection problem studied in (Krause & Cevher, 
2010), and therefore is approximately submodular 
with respect to the samples chosen. This allows us 
to reason about the performance of greedy forward se- 
lection algorithms for Bayesian Quadrature. We call 
this greedy method Sequential Bayesian Quadrature 
(sbq). 

We then demonstrate empirically the relative perfor- 
mance of herding, i.i.d random sampling, and SBQ, and 
demonstrate that SBQ attains a rate of convergence 
faster than 0(1/N). 



2 HERDING 

Herding was introduced by Welling (2009) as a method 
for generating pseudo-samples from a distribution in 
such a way that certain nonlinear moments of the 
sample set closely match those of the target distri- 
bution. The empirical mean J2n=i fx n over these 
pseudosamples is then used to estimate integral (1). 

2.1 Maximum Mean Discrepancy 

For selecting pseudosamples, herding relies on an 
objective based on the maximum mean discrepancy 
(MMD; Sriperumbudur et al., 2010). MMD measures 
the divergence between two distributions, p and q with 
respect to a class of integrand functions T as follows: 



MMDjr (p, q) 



sup 



f x p(x)dx - / f x q(x)dx 



(2) 



Intuitively, if two distributions are close in the MMD 
sense, then no matter which function / we choose from 
J 7 , the difference in its integral over p or q should be 
small. A particularly interesting case is when the func- 
tion class T is functions of unit norm from a reproduc- 
ing kernel Hilbert space (RKHS) %. In this case, the 
MMD between two distributions can be conveniently 
expressed using expectations of the associated kernel 
k(x,x') only (Sriperumbudur ct al., 2010): 



MMD% i (p,q)= sup 
fen 



f x p(x)dx- / f x q{x)dx 



,=i 



' k( X ,y) P ( x)p (y) dxd y 



(3) 
(4) 



2 J J k ^ x,y ^ p ^ q ^ dxdy 
+ J J k(x,y)q(x)q(y)dxdy, (5) 

where in the above formula fJ- P — J (f>(x)p(x)dx 6 % 
denotes the mean element associated with the dis- 
tribution p. For characteristic kernels, such as the 
Gaussian kernel, the mapping between a distribution 
and its mean element is bijective. As a consequence 
MMD^(p,ij) = if and only if p = q, making it a 
powerful measure of divergence. 

Herding uses maximum mean discrepancy to evaluate 
of how well the sample set {a:i, . . . , xn} represents the 
target distribution p: 



^herding ({^1; • ■ ■ j %N 



(6) 



k(x,y)p(x)p(y)dxdy 



1 n 1 N 

n— 1 n.m—l 



(J) 



The herding procedure greedily minimizes its objective 
^herding {{xi, ■ ■ ■ , x N }) , adding pseudosamples x n one 
at a time. When selecting the n + 1-st pseudosample: 



x n+ i <- argmin e herding ({x 1 , . . . ,x n ,x}) 



(8) 



1 - 

argmax 2E ;c ^p [fc(x,x')] — — fc(x,x m ), 



assuming k(x,x) — const. The formula (8) admits 
an intuitive interpretation: the first term encourages 
sampling in areas with high mass under the target dis- 
tribution p(x). The second term discourages sampling 
at points close to existing samples. 

Evaluating (8) requires us to compute ¥. x i^ p [k(x, x')}, 
that is to integrate the kernel against the target dis- 
tribution. Throughout the paper we will assume that 
these integrals can be computed in closed form. Whilst 
the integration can indeed be carried out analytically 
in several cases (Song et al., 2008; Chen et al., 2010), 
this requirement is the most pertinent limitation on 
applications of kernel herding, Bayesian quadrature 
and related algorithms. 



2.2 Complexity and Convergence Rates 

Criterion (8) can be evaluated in only 0(n) time. 
Adding these up for all subsequent samples, and as- 
suming that optimisation in each step has 0(1) com- 
plexity, producing N pseudosamples via kernel herding 
costs 0(N 2 ) operations in total. 

In finite dimensional Hilbert spaces, the herding al- 
gorithm has been shown to reduce MMD at a rate 
O(jj), which compares favourably with the 0{^=) 
rate obtained by non-deterministic Monte Carlo sam- 
plers. However, as pointed out by Bach et al. (2012), 
this fast convergence is not guaranteed in infinite di- 
mensional Hilbert spaces, such as the RKHS corre- 
sponding to the Gaussian kernel. 
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Figure 2: An illustration of Bayesian Quadrature. The 
function f(x) is sampled at a set of input locations. 
This induces a Gaussian process posterior distribution 
on /, which is integrated in closed form against the 
target density, p(x). Since the amount of volume under 
/ is uncertain, this gives rise to a (Gaussian) posterior 
distribution over Zf jP . 

3 BAYESIAN QUADRATURE 

So far, we have only considered integration methods 
in which the integral (1) is approximated by the em- 
pirical mean of the function evaluated at some set of 
samples, or pseudo-samples. Equivalently, we can say 
that Monte Carlo and herding both assign an equal 
weight to each of the samples. 

In (Rasmussen & Ghahramani, 2003), an alternate 
method is proposed: Bayesian Monte Carlo, or 
Bayesian quadrature (bq). bq puts a prior distribu- 
tion on /, then estimates integral (1) by inferring a 
posterior distribution over the function /, conditioned 
on the observations f(x n ) at some query points x n . 
The posterior distribution over / then implies a dis- 
tribution over Zt „. This method allows us to choose 
sample locations x n in any desired manner. See Figure 
2 for an illustration of Bayesian Quadrature. 

3.1 BQ Estimator 

Here we derive the BQ estimate of (1), after condition- 
ing on function evaluations f{x±) . . . /(xjv), denoted 
as f(X). The Bayesian solution implies a distribution 
over Zf tP . The mean of this distribution, E [Z] is the 
optimal Bayesian estimator for a squared loss. 

For simplicity, / is assigned a Gaussian process prior 
with kernel function k and mean 0. This assumption 
is very similar to the one made by kernel herding in 
Eqn. (7). 



After conditioning on f x , we obtain a closed- form pos- 
terior over /: 

p(f(x+)\f(X)) = M(f Xt l/K), cov(^, <)) (9) 

where 

f{x i< )=k{x^X)K- 1 f{X) (10) 

cov(a; 7t , =k(x±, x+) — k(x±, X)K~ 1 k(X, x±) (11) 

and K = k(X,X). Conveniently, the GP posterior 
allows us to compute the expectation of (1) in closed 
form: 



E GP \Z\ = E GP 



f(x)p{x)dx 



(12) 



f(x)p(f(x)\f(X))p(x)dxdf (13) 



= f{x)p(x)dx 



' k{x, X)p{x)dx 
= z T K-if(X) 



K-'fiX) 



(14) 

(15) 
(16) 



where 



z n = Jk(x, x n )p(x)dx = E x i~p [k(x n , x')] . (17) 

Conveniently, as in kernel herding, the desired expec- 
tation of Zf tP is simply a linear combination of ob- 
served function values f(x): 

E GP \Z\ = z T K- 1 f(X) 



,(n) 



fx 



where 



w 



{n) 

BQ 



s Tz T K~ 1 

/ j j nm 



(18) 
(19) 

(20) 



Thus, we can view the BQ estimate as a weighted ver- 
sion of the herding estimate. Interestingly, the weights 
w BQ do not need to sum to 1, and are not even neces- 
sarily positive. 

3.1.1 Non-normalized and Negative Weights 

When weighting samples, it is often assumed, or en- 
forced (as in Bach et al., 2012; Song et al., 2008), that 
the weights w form a probability distribution. How- 
ever, there is no technical reason for this requirement, 
and in fact, the optimal weights do not have this prop- 
erty. Figure 3 shows a representative set of 100 BQ 
weights chosen on samples representing the distribu- 
tion in figure 1. There are several negative weights, 
and the sum of all weights is 0.93. 

Figure 4 demonstrates that, in general, the sum of the 
Bayesian weights exhibits shrinkage when the number 
of samples is small. 
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Figure 3: A set of optimal weights given by BQ, af- 
ter 100 SBQ samples were selected on the distribution 
shown in Figure 1. Note that the optimal weights are 
spread away from the uniform weight (jj), and that 
some weights are even negative. The sum of these 
weights is 0.93. 
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Figure 4: An example of Bayesian shrinkage in the 
sample weights. In this example, the kernel width is 
approximately 1 /20 the width of the distribution be- 
ing considered. Because the prior over functions is 
zero mean, in the small sample case the weights are 
shrunk towards zero. The weights given by simple 
Monte Carlo and herding do not exhibit shrinkage. 



3.2 Optimal sampling for BQ 

Bayesian quadrature provides not only a mean 
estimate of Z/.p, but a full Gaussian posterior 
distribution. The variance of this distribution 
V [Zf iP \f Xl , . . . , f XN ] quantifies our uncertainty in the 
estimate. When selecting locations to evaluate the 
function /, minimising the posterior variance is a sen- 
sible strategy. Below, we give a closed form formula 
for the posterior variance of Zf p , conditioned on the 
observations f Xl . . . f XN , which we will denote by e^ Q . 
For a longer derivation, see Rasmussen & Ghahramani 
(2003). 



el Q (x!, ...,x N )=V [Z fiP \f Xl ,. ..,f x , 



(21) 



= Exy. P [k{x, x')\ - z T K~\ (22) 

where z„ = E x >^ p [k(x n , x')} as before. Perhaps sur- 
prisingly, the posterior variance of Zf tP does not de- 
pend on the observed function values, only on the loca- 
tion x n of samples. A similar independence is observed 
in other optimal experimental design problems involv- 
ing Gaussian processes (Krause et al., 2006). This al- 
lows the optimal samples to be computed ahead of 
time, before observing any values of / at all (Minka, 
2000). 

We can contrast the BQ objective e BQ in (22) to the 
objective being minimized in herding, e\ erding of equa- 
tion (7). Just like e1 erding , e| Q expresses a trade-off 
between accuracy and diversity of samples. On the 
one hand, as samples get close to high density regions 
under p, the values in z increase, which results in de- 
creasing variance. On the other hand, as samples get 
closer to each other, eigenvalues of K increase, result- 
ing in an increase in variance. 

In a similar fashion to herding, we may use a greedy 
method to minimise e BQ , adding one sample at a 
time. We will call this algorithm Sequential Bayesian 
Quadrature (sbq): 



x n+1 «- argmine BQ ({xi, 



,x n ,x}) (23) 



Using incremental updates to the Cholesky factor, the 
criterion can be evaluated in 0(n 2 ) time. Iteratively 
selecting TV samples thus takes 0(N 3 ) time, assuming 
optimisation can be done on 0(1) time. 

4 RELATING V [Z f>p ] TO MMD 



erding an d £ bq i S 



The similarity in the behaviour of t\ 
not a coincidence, the two quantities are closely related 
to each other, and to MMD. 



Proposition 1. The expected 
Bayesian quadrature ejL is the 



variance in 



maximum mean 



the 
dis- 



crepancy between the target distribution p and <7 B q(x) 



En=i w m'S Xn (x) 

Proof. The proof involves invoking the representor 
theorem, using bilinearity of scalar products and the 
fact that if / is a standard Gaussian process then 
V 5 :</,<?} ~^(0,N| K ): 



("); 



[Zf,p\fx! 7 ■ ■ ■ , fx N ] 



(24) 



E 



f~GP 



E 



f~GP 



= E 



f~GP 



J f(x)p(x)dx-^2wil ] f(x n )j (25) 

t . N \ 

/ (/, <t>{x)) p{x)dx - ^2 w m (/) H x n)} 

\ J n=l J 

(26) 

/, J <p{x)p{x)dx - ^2 w mH x n) ) 



1 1 A*p Mg B Q 1 1 
MMD 2 (p, q BQ ) 



(27) 
(28) 
(29) 

□ 



We know that the the posterior mean 
E GP [Ztp\f\, . . . , /at] is a Bayes estimator and 
has therefore the minimal expected squared error 
amongst all estimators. This allows us to further 



rewrite e BQ into the following minimax forms: 



BQ 



sup 

fen 

ll/ll««=i 

inf 



I f x p{x)dx - ^2 w m 

n=l 

Z-Z(f Xl , 



sup 

fen 
ll/ll«W=l 



f.ii-. 



(30) 



(31) 



inf sup 

ll/ll„«=l 



N 

/ f x p(x)dx - 2J Wnfx 



(32) 



Looking at e| Q this way, we may discover the deep 
similarity to the criterion e\ erding that kernel herding 
minimises. Optimal sampling for Bayesian quadrature 
minimises the same objective as kernel herding, but 
with the uniform ~ weights replaced by the optimal 
weights. As a corollary 



,xn) < 4ch( x i, ■ • -> x n) 



(33) 



It is interesting that e| q has both a Bayesian interpre- 
tation as posterior variance under a Gaussian process 
prior, and a frequentist interpretation as a minimax 
bound on estimation error with respect to an RKHS. 



5 SUBMODULARITY 

In this section, we use the concept of approximate sub- 
modularity (Krause & Cevher, 2010), in order to study 
convergence properties of SBQ. 

A set function s : 2 X *-> K is submodular if, for all 
A C B C X and Vx G X 

s(A U {x}) - s(A) > s(B U {x}) - s(B) (34) 

Intuitively, submodularity is a diminishing returns 
property: adding an element to a smaller set has larger 
relative effect than adding it to a larger set. A key 
result (see e. g. Krause & Cevher, 2010, and refer- 
ences therein) is that greedily maximising a submodu- 
lar function is guaranteed not to differ from the opti- 
mal strategy by more than a constant factor of (1— -). 

Herding and SBQ are examples of greedy algorithms 
optimising set functions: they add each pseudosamples 
in such a way as to minimize the instantaneous reduc- 
tion in MMD. So it is intuitive to check whether the 
objective functions these methods minimise are sub- 
modular. Unfortunately, neither Cherding, nor e BQ sat- 
isfies all conditions necessary for submodularity. How- 
ever, noting that SBQ is identical to the sparse dictio- 
nary selection problem studied in detail by Krause & 
Cevher (2010), we can conclude that SBQ satisfies a 
weaker condition called approximate submodularity. 

A set function s : 2 X H ► R is approximately submodular 
with constant e > 0, if for all A C B C X and Vx E X 

s(A U {x}) - s(A) > s{B U {x}) - s(B) - e (35) 



Proposition 2. 



~ £ bq(') * s "weakly a weakly sub- 



modular set function with constant e < 4r, where r is 
the incoherency 



k(x, x 1 ) 

max — . 
hx'ePQX y/k(x,x)k(x',x') 



(36) 



Proof. By the definition of MMD we can see that 

2 

~4 Q = inf^eR" l^p -Y,n=i w mK^ x n) is the 
negative squared distance between the mean element 
Hp and its projection onto the subspace spanned by the 
elements k(-,x n ). Substituting k = 1 into Theorem 1 
of Krause & Cevher (2010) concludes the proof. □ 

Unfortunately, weak submodularity does not provide 
the strong near-optimality guarantees as submodular- 
ity does . If s : 2 X i-» R is a weakly submodular 



function with constant e, and \A n \ 
greedy optimisation of s, then 



n is the result of 



s(A n ) > 1 - 



1 



max s(A) 

\A\<n 



(37) 



As pointed out by Krause & Cevher (2010), this guar- 
antee is very weak, as in our case the objective function 
£ bqW — £ bq(') i s upper bounded by a constant. How- 
ever, establishing a connection between SBQ and sparse 
dictionary selection problem opens up interesting di- 
rections for future research, and it may be possible 
to apply algorithms and theory developed for sparse 
dictionary selection to kernel-based quasi-Monte Carlo 
methods. 

6 EXPERIMENTS 

In this section, we examine empirically the rates of 
convergence of sequential Bayesian quadrature and 
herding. We examine both the expected error rates, 
and the empirical error rates. 

In all experiments, the target distribution p is chosen 
a 2D mixture of 20 Gaussians, whose equiprobability 
contours are shown in Figure 1. To ensure a compar- 
ison fair to herding, the target distribution, and the 
kernel used by both methods, correspond exactly to 
the one used in (Chen et al., 2010, Fig. 1). For ex- 
perimental simplicity, each of the sequential sampling 
algorithms minimizes the next sample location from a 
pool of 10000 locations randomly drawn from the base 
distribution. In practice, one would run a local opti- 
mizer from each of these candidate locations, however 
in our experiments we found that this did not make a 
significant difference in the sample locations chosen. 

6.1 Matching a distribution 

We first extend an experiment from (Chen et al., 2010) 
designed to illustrate the mode-seeking behavior of 
herding in comparison to random samples. In that 
experiment, it is shown that a small number of i. i. d. 
samples drawn from a multimodal distribution will 
tend to, by chance, assign too many samples to some 
modes, and too few to some other modes. In con- 
trast, herding places 'super-samples' in such a way as 
to avoid regions already well-represented, and seeks 
modes that are under-represented. 

We demonstrate that although herding improves upon 
i. i. d. sampling, the uniform weighting of super- 
samples leads to sub-optimal performance. Figure 1 
shows the first 20 samples chosen by kernel herding, in 
comparison with the first 8 samples chosen by SBQ. By 
weighting the 8 SBQ samples by the quadrature weights 
in (20), we can obtain the same expected loss as by us- 
ing the 20 uniformly-weighted herding samples. Figure 
5 shows MMD versus the number of samples added, on 
the distribution shown in Figure 1. We can see that in 
all cases, SBQ dominates herding. It appears that SBQ 
converges at a faster rate than (D(l/N), although the 
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Figure 5: The maximum mean discrepancy, or ex- 
pected error of several different quadrature methods. 
Herding appears to approach a rate close to 0(1/N). 
SBQ appears to attain a faster, but unknown rate. 



form of this rate is unknown. 

There are two differences between herding and SBQ: 
SBQ chooses samples according to a different criterion, 
and also weights those samples differently. We may 
ask whether the sample locations or the weights are 
contributing more to the faster convergence of SBQ. 
Indeed, in Figure 1 we observe that the samples se- 
lected by SBQ are quite similar to the samples se- 
lected by kernel herding. To answer this question, 
we also plot in Figure 5 the performance of a fourth 
method, which selects samples using herding, but later 
re-weights the herding samples with BQ weights. Ini- 
tially, this method attains similar performance to SBQ, 
but as the number of samples increases, SBQ attains a 
better rate of convergence. This result indicates that 
the different sample locations chosen by SBQ, and not 
only the optimal weights, are responsible for the in- 
creased convergence rate of SBQ. 

6.2 Estimating Integrals 

We then examined the empirical performance of the 
different estimators at estimating integrals of real func- 
tions. To begin with, we looked at performance on 100 
randomly drawn functions, of the form: 

10 

f( x ) = ^ <*ik(x, Ct) (38) 




number of samples 



Figure 6: Within-model error: The empirical error 
rate in estimating Zf iP , for several different sampling 
methods, averaged over 250 functions randomly drawn 
from the RKHS corresponding to the kernel used. 

where 

10 10 

n/iiw=EE a ^ fe ( c - c i) = 1 ( 39 ) 

That is, these functions belonged exactly to the unit 
ball of the RKHS defined by the kernel k(x, x') used to 
model them. Figure 6 shows the empirical error versus 
the number of samples, on the distribution shown in 
Figure 1. The empirical rates attained by the method 
appear to be similar to the MMD rates in Figure 5. 

By definition, MMD provides a upper bound on the 
estimation error in the integral of any function in 
the unit ball of the RKHS (Eqn. (3)), including the 
Bayesian estimator, SBQ. Figure 7 demonstrates this 
quickly decreasing bound on the SBQ empirical error. 

6.3 Out-of-model performance 

A central assumption underlying SBQ is that the in- 
tegrand function belongs to the RKHS specified by 
the kernel. To see how performance is effected if this 
assumption is violated, we performed empirical tests 
with functions chosen from outside the RKHS. We 
drew 100 functions of the form: 

10 

f(x) = ^a l exp(--( a ;- Cl ) T Sr 1 (^-c 4 ) (40) 
i=i 

where each on Cj Ej were drawn from broad distribu- 
tions. This ensured that the drawn functions had fea- 
tures such as narrow bumps and ridges which would 
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Figure 7: The empirical error rate in estimating Zf jP , 
for the SBQ estimator, on 10 random functions drawn 
from the RKHS corresponding to the kernel used. Also 
shown is the upper bound on the error rate implied by 
the MMD. 

not be well modelled by functions belonging to the 
isotropic kernel defined by k. Figure 8 shows that, 
on functions drawn from outside the assumed RKHS, 
relative performance of all methods remains similar. 

Code to reproduce all results is available at 
http : //mlg . eng . cam . ac . uk/duvenaud/ 

7 DISCUSSION 

7.1 Choice of Kernel 

Using herding techniques, we are able to achieve fast 
convergence on a Hilbert space of well-behaved func- 
tions, but this fast convergence is at the expense of the 
estimate not necessarily converging for functions out- 
side this space. If we use a characteristic kernel (Sripe- 
rumbudur ct al., 2010), such as the exponentiated- 
quadratic or Laplacian kernels, then convergence in 
MMD implies weak convergence of qjy to the target 
distribution. This means that the estimate converges 
for any bounded measurable function /. The speed of 
convergence, however, may not be as fast. 

Therefore it is crucial that the kernel we choose is 
representative of the function or functions / we will 
integrate. For example, in our experiments, the con- 
vergence of herding was sensitive to the width of the 
Gaussian kernel. One of the major weaknesses of ker- 
nel methods in general is the difficulty of setting kernel 
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Figure 8: Out-of-model error: The empirical error 
rates in estimating Zf iP , for several different sampling 
methods, averaged over 250 functions drawn from out- 
side the RKHS corresponding to the kernels used. 

parameters. A key benefit of the Bayesian interpreta- 
tion of herding and MMD presented in this paper is 
that it provides a recipe for adapting the Hilbert space 
to the observations f{x n ). To be precise, we can fit the 
kernel parameters by maximizing the marginal likeli- 
hood of Gaussian process conditioned on the observa- 
tions. Details can be found in (Rasmussen & Williams, 
2006). 

7.2 Computational Complexity 

While we have shown that Bayesian Quadrature pro- 
vides the optimal re- weighting of samples, computing 
the optimal weights comes at an increased computa- 
tional cost relative to herding. The computational 
complexity of computing Bayesian quadrature weights 
for N samples is 0(N 3 ), due to the necessity of in- 
verting the Gram matrix K(x,x). Using the Wood- 
bury identity, the cost of adding a new sample to an 
existing set is 0(N 2 ). For herding, the computational 
complexity of evaluating a new sample is only 0(A), 
making the cost of choosing N herding samples 0(N 2 ). 
For Monte Carlo sampling, the cost of adding an i.i.d. 
sample from the target distribution is only 0(1). 

The relative computational cost of computing samples 
and weights using BQ, herding, and sampling must be 
weighed against the cost of evaluating / at the sample 
locations. Depending on this trade-off, the three sam- 
pling methods form a Pareto frontier over computa- 
tional speed and estimator accuracy. When computing 



method I complexity rate 



guarantee 



MCMC 
i.i.d. MC 
herding 
SBQ 



0(N) 
O(N) 
0(N 2 ) 
0{N 3 ) 



variable 

1 

i > . > i 

/N — ~ N 

unknown 



ergodic theorem 
law of large numbers 
(Chen et al., 2010; Bach et al, 2012) 
approximate submodularity 



Table 1: A comparison of the rates of convergence and computational complexity of several integration methods. 



/ is cheap, we may wish to use Monte Carlo methods. 
In cases where / is computationally costly, we would 
expect to choose the SBQ method. When / is relatively 
expensive, but a very large number of samples are re- 
quired, we may choose to use kernel herding instead. 
However, because the rate of convergence of SBQ is 
faster, there may be situations in which the 0(N 3 ) 
cost is relatively inexpensive, due to the smaller N 
required by SBQ to achieve the same accuracy as com- 
pared to using other methods. 

There also exists the possibility to switch to a less 
costly sampling algorithm as the number of samples 
increases. Table 1 summarizes the rates of convergence 
of all the methods considered here. 

8 CONCLUSIONS 

In this paper, we have shown three main results: First, 
we proved that the loss minimized by kernel herding 
is closely related to the loss minimized by Bayesian 
quadrature, when selecting sample locations. This im- 
plies that sequential Bayesian quadrature can viewed 
as an optimally-weighted version of kernel herding. 

Second, we showed that the loss minimized by the 
Bayesian method is approximately submodular with 
respect to the samples chosen, and established connec- 
tions to the submodular dictionary selection problem 
studied in (Krause & Cevher, 2010). 

Finally, we empirically demonstrated a superior rate of 
convergence of SBQ over herding, and demonstrated a 
bound on the empirical error of the Bayesian quadra- 
ture estimate. 

8.1 Future Work 

In section 5, we showed that SBQ is approximately 
submodular, which provides only weak sub-optimality 
guarantees of its performance. It would be of interest 
to further explore the connection between Bayesian 
Quadrature and the dictionary selection problem to 
see if algorithms developed for dictionary selection can 
provide further practical or theoretical developments. 
The results in section 6, specifically Figure 5, sug- 
gest that the convergence rate of SBQ is faster than 



0(1/N). However, we are not aware of any work show- 
ing what the theoretically optimal rate is. It would be 
of great interest to determine this optimal rate of con- 
vergence for particular classes of kernels. 
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